Statistical Programming in R

We use the following packages

library(MASS)     # Datasets
library(mice)     # Boys dataset
library(dplyr)    # Data manipulation
library(magrittr) # Pipes
library(ggplot2)  # Plotting suite
library(sf)       # Spatial features

Visualising in R

  • R makes it very easy to visualise data
  • But fine-tuning figures to specific standards can take a lot of time

Why visualise?

  • We can process a lot of information quickly with our eyes
  • More intuitively accessible to laymen
  • Plots give us information about
    • Distribution / shape
    • Irregularities
    • Assumptions
    • Intuitions

Why visualise?

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Why visualise?

What we will do

  • A few plots in base graphics in R
  • Plotting with ggplot2 graphics
  • Plotting data on maps

base graphics in R

Scatter plot

plot(x = boys$hgt, y = boys$wgt, main = "Scatter plot", 
     xlab = "Height", ylab = "Weight", bty = "L")

Line chart

plot(x = 1900+phones$year, y = phones$calls, type = "l", main = "Line chart",
     xlab = "Year", ylab = "Phone calls in Belgium, millions", bty = "L")

Bar chart

counts <- table(boys$reg)

barplot(counts, main="Bar chart", ylab = "N")

Pie chart

counts <- table(boys$reg)

pie(x=counts, main="Pie chart")

Histogram

hist(boys$hgt, main = "Histogram", xlab = "Height")

Density

dens <- density(boys$hgt, na.rm = TRUE)
plot(dens, main = "Density plot", xlab = "Height", bty = "L")

Box plot

boxplot(boys$hgt ~ boys$reg, main = "Boxplot", 
        xlab = "Region", ylab = "Height")

A lot can be done in base R!

boys %>% md.pattern() # from mice

##     age reg wgt hgt bmi hc gen phb  tv     
## 223   1   1   1   1   1  1   1   1   1    0
## 19    1   1   1   1   1  1   1   1   0    1
## 1     1   1   1   1   1  1   1   0   1    1
## 1     1   1   1   1   1  1   0   1   0    2
## 437   1   1   1   1   1  1   0   0   0    3
## 43    1   1   1   1   1  0   0   0   0    4
## 16    1   1   1   0   0  1   0   0   0    5
## 1     1   1   1   0   0  0   0   0   0    6
## 1     1   1   0   1   0  1   0   0   0    5
## 1     1   1   0   0   0  1   1   1   1    3
## 1     1   1   0   0   0  0   1   1   1    4
## 1     1   1   0   0   0  0   0   0   0    7
## 3     1   0   1   1   1  1   0   0   0    4
##       0   3   4  20  21 46 503 503 522 1622

Many R objects have a plot() method

result <- lm(age~wgt, boys)
plot(result, which = 1)

Neat! But what if we want more control?

ggplot2

What is ggplot2?

Layered plotting based on the book The Grammar of Graphics by Leland Wilkinsons.

With ggplot2 you

  1. provide the data
  2. define how to map variables to aesthetics
  3. state which geometric object to display
  4. (optional) edit the overall theme of the plot

ggplot2 then takes care of the details

An example: scatterplot

1: Provide the data

boys %>%
  ggplot()

2: map variable to aesthetics

boys %>%
  ggplot(aes(x = age, y = bmi))

3: state which geometric object to display

boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point()

An example: scatterplot

Why this syntax?

Create the plot

gg <- 
  boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point(col = "dark green")

Add another layer (smooth fit line)

gg <- gg + 
  geom_smooth(col = "dark blue")

Give it some labels and a nice look

gg <- gg + 
  labs(x = "Age", y = "BMI", title = "BMI trend for boys") +
  theme_minimal()

Why this syntax?

plot(gg)

Why this syntax?

Aesthetics

  • x
  • y
  • size
  • colour
  • fill
  • opacity (alpha)
  • linetype
  • …

Aesthetics

gg <- 
  boys %>% 
  filter(!is.na(reg)) %>% 
  
  ggplot(aes(x      = age, 
             y      = bmi, 
             size   = hc, 
             colour = reg)) +
  
  geom_point(alpha = 0.5) +
  
  labs(title  = "BMI trend for boys",
       x      = "Age", 
       y      = "BMI", 
       size   = "Head circumference",
       colour = "Region") +
  theme_minimal()

Aesthetics

plot(gg)

Geoms

  • geom_point
  • geom_bar
  • geom_line
  • geom_smooth

  • geom_histogram
  • geom_boxplot
  • geom_density

Geoms: Bar

Geoms: Line

Geoms: Smooth

Geoms: Boxplot

Geoms: Density

Helpful link in RStudio

Maps

Simple Features

  • A formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects.
  • As implemented e.g. in ArcGIS
  • Implemented for R in the sf package
  • Feature geometries are stored in data.frames

We have time for a cursory introduction at most.

Reading in spatial data

denmark <- st_read("DK_map.shp")
## Reading layer `DK_map' from data source `C:\Users\tgw513\Documents\GitHub\Rbosnia\Contents\Material\Part I - Data visualization\DK_map.shp' using driver `ESRI Shapefile'
## Simple feature collection with 306 features and 6 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 441524.8 ymin: 6049785 xmax: 892800.8 ymax: 6402308
## epsg (SRID):    NA
## proj4string:    +proj=utm +zone=32 +ellps=GRS80 +units=m +no_defs
plot(st_geometry(denmark))

Plotting regional attributes

denmark$proportion.over.70 <- denmark$over70/denmark$population

plot(denmark["proportion.over.70"],
     main = "Proportion of population aged 70 years and above")

Or we can ggplot

denmark %>% ggplot(aes(fill=proportion.over.70)) + geom_sf()

Practical